Risk-Averse Allocation Indices for Multiarmed Bandit Problem
نویسندگان
چکیده
In classical multiarmed bandit problem, the aim is to find a policy maximizing expected total reward, implicitly assuming that decision-maker risk-neutral. On other hand, decision-makers are risk-averse in some real-life applications. this article, we design new setting based on concept of dynamic risk measures where with best risk-adjusted discounted outcome. We provide theoretical analysis problem respect novel and propose priority-index heuristic which gives risk-averse allocation indices having structure similar Gittins index. Although an optimal shown not always have index-based form, empirical results express excellence show can achieve or near-optimal interpretable policies.
منابع مشابه
The Irrevocable Multiarmed Bandit Problem
This paper considers the multi-armed bandit problem with multiple simultaneous arm pulls and the additional restriction that we do not allow recourse to arms that were pulled at some point in the past but then discarded. This additional restriction is highly desirable from an operational perspective and we refer to this problem as the ‘Irrevocable Multi-Armed Bandit’ problem. We observe that na...
متن کاملThe Nonstochastic Multiarmed Bandit Problem
In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff...
متن کاملA Lemma on the Multiarmed Bandit Problem
We prove a lemma on the optimal value function for the mdtiarmed bandit problem which provides a simple direct proof of optimality of writeoff policies. This, in turn, leads to a new proof of optimality of the index rule.
متن کاملAsymptotically Efficient Allocation Rules for the - Multiarmed Bandit Problem with Multiple Plays - Part 11 : Markovian Rewards
At each instant of lime we are required to sample a fixed number rn 2 1 out of N Markov chains whose stationary transition probability matrices belong to a family suitably parameterized by a real number 8. The objective is to maximize the long run expected value of the samples. The learning loss of a sampling scheme corresponding to a parameters configuration C = (el,. .. , e, %*) is quantified...
متن کاملFinite-Time Regret Bounds for the Multiarmed Bandit Problem
We show finite-time regret bounds for the multiarmed bandit problem under the assumption that all rewards come from a bounded and fixed range. Our regret bounds after any number T of pulls are of the form a+b logT+c log2 T , where a, b, and c are positive constants not depending on T . These bounds are shown to hold for variants of the popular "-greedy and Boltzmann allocation rules, and for a ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Automatic Control
سال: 2021
ISSN: ['0018-9286', '1558-2523', '2334-3303']
DOI: https://doi.org/10.1109/tac.2021.3053539